type signature
Evaluating Program Semantics Reasoning with Type Inference in System F
Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Benchpure, a purely semanticsdriven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet)
Music Arena: Live Evaluation for Text-to-Music
Kim, Yonghyun, Chi, Wayne, Angelopoulos, Anastasios N., Chiang, Wei-Lin, Saito, Koichi, Watanabe, Shinji, Mitsufuji, Yuki, Donahue, Chris
We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering *live* evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of *detailed* preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org . Preference data is available at: https://huggingface.co/music-arena .
Evaluating Program Semantics Reasoning with Type Inference in System F
He, Yifeng, Yang, Luning, Gonzalo, Christopher Castro Gaw, Chen, Hao
Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.
Encoding architecture algebra
Bersier, Stephane, Chen-Lin, Xinyi
There is growing awareness of the importance of designing model architectures that capture and respect the distinct structure of input data. Many successful deep learning architectures, 2 such as transformers [1], convolutional neural networks (CNNs)[2], graph neural networks (GNNs) [3], and recurrent neural networks (RNNs)[4], inherently incorporate aspects of data structure. Ongoing research focuses on refining existing architectures, as well as designing new ones for other types of structured data. For instance, DeepSets [5] are tailored to process sets, group and gauge equivariant CNNs [6][7] respect both global and local symmetries in the data, and strongly-typed RNNs [8] incorporate explicit types within recurrent networks. By accounting for the structure of the input data, these model architectures exhibit improved performance, better generalization with fewer parameters, and enhanced interpretability.
Test Case Features as Hyper-heuristics for Inductive Programming
Instruction subsets are heuristics that can reduce the size of the inductive programming search space by tens of orders of magnitude. Comprising many overlapping subsets of different sizes, they serve as predictions of the instructions required to code a solution for any problem. Currently, this approach employs a single, large family of subsets meaning that some problems can search thousands of subsets before a solution is found. In this paper we introduce the use of test case type signatures as hyper-heuristics to select one of many, smaller families of instruction subsets. The type signature for any set of test cases maps directly to a single family and smaller families mean that fewer subsets need to be considered for most problems. Having many families also permits subsets to be reordered to better reflect their relative occurrence in human code - again reducing the search space size for many problems. Overall the new approach can further reduce the size of the inductive programming search space by between 1 and 3 orders of magnitude, depending on the type signature. Larger and more consistent reductions are possible through the use of more sophisticated type systems. The potential use of additional test case features as hyper-heuristics and some other possible future work is also briefly discussed.
A Formal Algebraic Framework for DSL Composition
Flores, Zachary, Taranto, Angelo, Bond, Eric
We discuss a formal framework for using algebraic structures to model a meta-language that can write, compose, and provide interoperability between abstractions of DSLs. The purpose of this formal framework is to provide a verification of compositional properties of the meta-language. Throughout our paper we discuss the construction of this formal framework, as well its relation to our team's work on the DARPA V-SPELLS program via the pipeline we have developed for completing our verification tasking on V-SPELLS. We aim to give a broad overview of this verification pipeline in our paper. The pipeline can be split into four main components: the first is providing a formal model of the meta-language in Coq; the second is to give a specification in Coq of our chosen algebraic structures; third, we need to implement specific instances of our algebraic structures in Coq, as well as give a proof in Coq that this implementation is an algebraic structure according to our specification in the second step; and lastly, we need to give a proof in Coq that the formal model for the meta-language in the first step is an instance of the implementation in the third step.
Type-driven Neural Programming by Example
In this thesis we look into programming by example (PBE), which is about finding a program mapping given inputs to given outputs. PBE has traditionally seen a split between formal versus neural approaches, where formal approaches typically involve deductive techniques such as SAT solvers and types, while the neural approaches involve training on sample input-outputs with their corresponding program, typically using sequence-based machine learning techniques such as LSTMs [41]. As a result of this split, programming types had yet to be used in neural program synthesis techniques. We propose a way to incorporate programming types into a neural program synthesis approach for PBE. We introduce the Typed Neuro-Symbolic Program Synthesis (TNSPS) method based on this idea, and test it in the functional programming context to empirically verify type information may help improve generalization in neural synthesizers on limited-size datasets. Our TNSPS model builds upon the existing Neuro-Symbolic Program Synthesis (NSPS), a tree-based neural synthesizer combining info from input-output examples plus the current program, by further exposing information on types of those input-output examples, of the grammar production rules, as well as of the hole that we wish to expand in the program. We further explain how we generated a dataset within our domain, which uses a limited subset of Haskell as the synthesis language. Finally we discuss several topics of interest that may help take these ideas further. For reproducibility, we release our code publicly.
From Math To Machine
In this post I'm going to explore how a mathematical concept can be redefined in progressively more computer-oriented terms, all the way from high level languages down to machine code, ready for direct execution by a computer. To that end, I'm going to define the same logic in several different but related formats: If you're interested in how language styles can differ or curious about what your code might look like after being compiled, keep reading! A factorial is the product of an integer and all smaller integers greater than 0. There are lots of ways to describe a definition like this. This definition states that n! is the product of all integers from 1 to n. One important use of factorials is calculating the total number of permutations of a set. For example, the string "cat" can be rearranged in 6 possible ways: "cat", "act", "atc", "tac", "tca", and "cta". This string has 3 letters and 3! 6. The string "a", which has one character, can only be arranged in that one way.
Coarse-to-Fine Inference and Learning for First-Order Probabilistic Models
Kiddon, Chloe (University of Washington) | Domingos, Pedro (University of Washington)
Coarse-to-fine approaches use sequences of increasingly fine approximations to control the complexity of inference and learning. These techniques are often used in NLP and vision applications. However, no coarse-to-fine inference or learning methods have been developed for general first-order probabilistic domains, where the potential gains are even higher. We present our Coarse-to-Fine Probabilistic Inference (CFPI) framework for general coarse-to-fine inference for first-order probabilistic models, which leverages a given or induced type hierarchy over objects in the domain. Starting by considering the inference problem at the coarsest type level, our approach performs inference at successively finer grains, pruning high- and low-probability atoms before refining. CFPI can be applied with any probabilistic inference method and can be used in both propositional and relational domains. CFPI provides theoretical guarantees on the errors incurred, and these guarantees can be tightened when CFPI is applied to specific inference algorithms. We also show how to learn parameters in a coarse-to-fine manner to maximize the efficiency of CFPI. We evaluate CFPI with the lifted belief propagation algorithm on social network link prediction and biomolecular event prediction tasks. These experiments show CFPI can greatly speed up inference without sacrificing accuracy.
Leveraging Ontologies for Lifted Probabilistic Inference and Learning
Kiddon, Chloe Marielle (University of Washington) | Domingos, Pedro (University of Washington)
Exploiting ontologies for efficient inference is one of the most widely studied topics in knowledge representation and reasoning. The use of ontologies for probabilistic inference, however, is much less developed. A number of algorithms for lifted inference in first-order probabilistic languages have been proposed, but their scalability is limited by the combinatorial explosion in the sets of objects that need to be considered. We propose a coarse-to-fine inference approach that leverages a class hierarchy to combat this problem. Starting at the highest level, our approach performs inference at successively finer grains, pruning low-probability atoms before refining. We provide bounds on the error incurred by this approach relative to full ground inference as a function of the pruning threshold. We also show how to learn parameters in a coarse-to-fine manner to maximize the opportunities for pruning during inference. Experiments on link prediction and biomolecular event prediction tasks show our method can greatly improve the scalability of lifted probabilistic inference.